tidyverse I: dplyr;
gapmindertidyverseII: readr, ggplot2;
Public Data, WDI, WIR, etctidyverse III: tidyr, etc.; WDI, WIR,
etctidyverse IV; WDI, WIR, etclibrary(tidyverse)
library(readxl)
url_summary <- "https://wir2022.wid.world/www-site/uploads/2022/03/WIR2022TablesFigures-Summary.xlsx"
download.file(url = url_summary, destfile = "data/WIR2022s.xlsx")
excel_sheets("data/WIR2022s.xlsx")
[1] "Index" "F1" "F2" "F3" "F4" "F5." "F6"
[8] "F7" "F8" "F9" "F10" "F11" "F12" "F13"
[15] "F14" "F15" "T1" "data-F1" "data-F2" "data-F3" "data-F4"
[22] "data-F5" "data-F6" "data-F7" "data-F8" "data-F9" "data-F10" "data-F11"
[29] "data-F12" "data-F13." "data-F14." "data-F15"
df_f1 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F1")
df_f1
df_f1_rev %>%
ggplot(aes(x = cat, y = value, fill = group)) +
geom_col(position = "dodge")
tidyrTidy Your Data
“Data comes in many formats, but R prefers just one: tidy data.” — Garrett Grolemund
Data can come in a variety of formats, but one format is easier to use in R than the others. This format is known as tidy data. A data set is tidy if:
“Tidy data sets are all alike; but every messy data set is messy in its own way.” — Hadley Wickham
“all happy families are all alike; each unhappy family is unhappy in its own way” - Tolstoy’s Anna Karenina
tidyr
Basicspivot_longer()pivot_longer(data, cols = <columns to pivot into longer format>,
names_to = <name of the new character column>, # e.g. "group", "category", "class"
values_to = <name of the column the values of cells go to>) # e.g. "value", "n"
df_f1
(df_f1_rev <- df_f1 %>% pivot_longer(-1, names_to = "group", values_to = "value"))
df_f1_rev %>%
ggplot(aes(x = ...1, y = value, fill = group)) +
geom_col(position = "dodge")
df_f1_rev %>% filter(group != "Top 1%") %>%
ggplot() +
geom_col(aes(x = ...1, y = value, fill = group), position = "dodge") +
geom_text(aes(x = ...1, y = value, group = group,
label = scales::label_percent(accuracy=1)(value)),
position = position_dodge(width = 0.9)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Figure 1. Global income and wealth inequality, 2021",
x = "", y = "Share of total income or wealth", fill = "")
Interpretation: The global bottom 50% captures 8.5%
of total income measured at Purchasing Power Parity (PPP). The global
bottom 50% owns 2% of wealth (at Purchasing Power Parity). The global
top 10% owns 76% of total Household wealth and captures 52% of total
income in 2021. Note that top wealth holders are not necessarily top
income holders. Incomes are measured after the operation of pension and
unemployment systems and before taxes and transfers.
Sources and series: wir2022.wid.world/methodology.
pivot_wider()
In Console: vignette(“pivot”)
pivot_wider(data,
names_from = <name of the column (or columns) to get the name of the output column>,
values_from = <name of the column to get the value of the output>)
pivot_wider(data, names_from = group, values_from = value)
F4 and F13 are similar. Please use pivot_longer to tidy
the data and create charts.
df_f3 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F3")
df_f3
df_f3$T10B50 %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.394 10.958 15.676 17.635 19.838 139.591
df_f3 %>% ggplot() + geom_histogram(aes(T10B50))
df_f3 %>% arrange(desc(T10B50))
df_f3 %>%
mutate(`Top 10 Bottom 50 Ratio` = cut(T10B50,breaks = c(5, 12, 13, 16, 19,140),
include.lowest = FALSE))
world_map <- map_data("world")
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` = cut(T10B50,breaks = c(5, 12, 13, 16, 19,140),
include.lowest = FALSE)) %>%
ggplot(aes(map_id = Country)) +
geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), map = world_map) +
expand_limits(x = world_map$long, y = world_map$lat)
world_map_wir <- world_map
world_map_wir$region[
world_map_wir$region=="Democratic Republic of the Congo"]<-"DR Congo"
world_map_wir$region[world_map_wir$region=="Republic of Congo"]<-"Congo"
world_map_wir$region[world_map_wir$region=="Ivory Coast"]<-"Cote dIvoire"
world_map_wir$region[world_map_wir$region=="Vietnam"]<-"Viet Nam"
world_map_wir$region[world_map_wir$region=="Russia"]<-"Russian Federation"
world_map_wir$region[world_map_wir$region=="South Korea"]<-"Korea"
world_map_wir$region[world_map_wir$region=="UK"]<-"United Kingdom"
world_map_wir$region[world_map_wir$region=="Brunei"]<-"Brunei Darussalam"
world_map_wir$region[world_map_wir$region=="Laos"]<-"Lao PDR"
world_map_wir$region[world_map_wir$region=="Cote dIvoire"]<-"Cote d'Ivoire"
world_map_wir$region[world_map_wir$region=="Cape Verde"]<- "Cabo Verde"
world_map_wir$region[world_map_wir$region=="Syria"]<- "Syrian Arab Republic"
world_map_wir$region[world_map_wir$region=="Trinidad"]<- "Trinidad and Tobago"
world_map_wir$region[world_map_wir$region=="Tobago"]<- "Trinidad and Tobago"
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` =
cut(T10B50, breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
ggplot(aes(map_id = Country)) +
geom_map(aes(fill = `Top 10 Bottom 50 Ratio`),
map = world_map_wir) +
expand_limits(x = world_map_wir$long, y = world_map_wir$lat)
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` =
cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`),
map = world_map_wir) + expand_limits(x = world_map_wir$long, y = world_map_wir$lat) +
coord_map("orthographic", orientation = c(25, 60, 0))
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` =
cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`),
map = world_map_wir) + expand_limits(x = world_map_wir$long, y = world_map_wir$lat) +
coord_map("orthographic", orientation = c(15, -80, 0))
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` =
cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`),
map = world_map_wir) +
expand_limits(x = world_map_wir$long, y = world_map_wir$lat)
df_f3 %>%
mutate(`Top 10 Bottom 50 Ratio` =
cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
ggplot(aes(map_id = Country)) +
geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), map = world_map_wir) +
expand_limits(x = world_map_wir$long, y = world_map_wir$lat) +
labs(title = "Figure 3. Top 10/Bottom 50 income gaps across the world, 2021",
x = "", y = "", fill = "Top 10/Bottom 50 ratio") +
theme(legend.position="bottom",
axis.text.x=element_blank(), axis.ticks.x=element_blank(),
axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
scale_fill_brewer(palette='YlOrRd')
df_f3 %>% anti_join(world_map_wir, by = c("Country" = "region"))
Filtering joins
anti_join(x,y, ...): return all rows from x without a
match in y.semi_join(x,y, ...): return all rows from x with a
match in y.Check dplyr cheat sheet, and Posit Primers Tidy
Data.
F5: Global income inequality: T10/B50 ratio, 1820-2020 - fit curve
F9: Average annual wealth growth rate, 1995-2021 - fit curve + alpha
F7: Global income inequality, 1820-2020 - pivot + fit curve
F10: The share of wealth owned by the global 0.1% and billionaires, 2021 - pivot + fit curve
F6: Global income inequality: Between vs. Within country inequality (Theil index), 1820-2020 - pivot + area
F11: Top 1% vs bottom 50% wealth shares in Western Europe and the US, 1910-2020 - pivot name_sep + fit curve
F8: The rise of private versus the decline of public wealth in rich countries, 1970-2020 - rename + pivot + pivot + fit curve
F15: Per capita emissions acriss the world, 2019 - add row names + dodge
(df_f5 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F5"))
df_f5 %>% ggplot(aes(x = y, y = t10b50)) + geom_line() + geom_smooth(span=0.25, se=FALSE)
df_f9 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F9"); df_f9
df_f9 %>%
ggplot(aes(x = p, y = `Wealth growth 1995-2021`)) + geom_smooth(span = 0.30, se = FALSE)
df_f7 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F7"); df_f7
df_f7 %>%
pivot_longer(cols = 2:4, names_to = "type", values_to = "value") %>%
ggplot(aes(x = y, y = value, color = type)) +
stat_smooth(formula = y~x, method = "loess", span = 0.25, se = FALSE)
df_f6 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F6"); df_f6
New names:
df_f6 %>% select(year = "...1", 2:3) %>%
pivot_longer(cols = 2:3, names_to = "type", values_to = "value") %>%
mutate(types = factor(type,
levels = c("Within-country inequality", "Between-country inequality"))) %>%
ggplot(aes(x = year, y = value, fill = types)) +
geom_area() +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(breaks = round(seq(1820, 2020, by = 20),1)) +
scale_fill_manual(values=rev(scales::hue_pal()(2)),
labels = function(x) str_wrap(x, width = 15)) +
labs(title = "Figure 6. Global income inequality:
\nBetween vs. within country inequality (Theil index), 1820-2020",
x = "", y = "Share of global inequality (% of total Theil index)", fill = "") +
annotate("text", x = 1850, y = 0.28,
label = stringr::str_wrap("1820: Between country inequality represents 11%
of global inequality", width = 20), size = 3) +
annotate("text", x = 1980, y = 0.70,
label = stringr::str_wrap("1980: Between country inequality represents 57%
of global inequality", width = 20), size = 3) +
annotate("text", x = 1990, y = 0.30,
label = stringr::str_wrap("2020: Between country inequality represents 32%
of global inequality", width = 20), size = 3)
df_f8 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F8"); df_f8
df_f8 %>%
select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)') %>%
pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
pivot_longer(3:4, names_to = "type", values_to = "value") %>%
ggplot() +
stat_smooth(aes(x = year, y = value, color = country, linetype = type),
span = 0.25, se = FALSE, size=0.75) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Figure 8. The rise of private versus the decline of public
wealth in rich countries, 1970-2020",
x = "", y = "wealth as as % of national income", color = "", type = "")
df_f8 %>%
select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)')
df_f8 %>%
select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)') %>%
pivot_longer(!year, names_to = c("country",".value"), names_sep = "_")
df_f8 %>%
select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)') %>%
pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
pivot_longer(3:4, names_to = "type", values_to = "value")
df_f8 %>%
select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)') %>%
pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
pivot_longer(3:4, names_to = "type", values_to = "value") %>%
ggplot() +
stat_smooth(aes(x = year, y = value, color = country, linetype = type),
formula = y~x, method = "loess", span = 0.25, se = FALSE, size=0.75) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Figure 8. The rise of private versus the decline of public wealth
\nin rich countries, 1970-2020",
x = "", y = "wealth as as % of national income", color = "", type = "")
df_f15 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F15"); df_f15
df_f15 %>% mutate(region = rep(regionWID[!is.na(regionWID)], each = 3)) %>%
select(region, group, tcap) %>%
ggplot(aes(x = region, y = tcap, fill = group)) +
geom_col(position = "dodge") +
scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 10)) +
labs(title = "Figure 15 Per capita emissions across the world, 2019",
x = "", y = "tonnes of CO2e per person per year", fill = "")
Repeat the process during your EDA.
image
In RStudio,
1.1. Project
project_name.Rproj in your
project folder (directory)1.2. data folder (directory) data
1.3. Move (or copy) data for the project to the data folder
data.data:
Press Files at the right bottom pane and click data, the
data folder.2.1. Project Notebook: Memo
Create an R Notebook: File > New File > R Notebook
Add descriptive title.
2.2. Setup Code Chunk
Create a code chunk and add packages to use in the project and RUN the code.
2.3. Choose Source or Visual editor mode,
and start editing Project Notebook
Set up Headings such as: About, Data, Analysis and Visualizations, Conclusions
Under About or Data, paste url of the sites and/or the data
2.4. Edit a new file by saving as for a report
Assign a name you can recall easily when you import data. You may need to reload the data with options.
3.1. Use a package:
write_csv(wdi_shortname, "data/wdi_shortname.csv")wdi_shortname <- read_csv("data/wdi_shortname.csv")3.2. Use readr to read from data, your data
folder
df1_shortname <- read_csv("data/file_name.csv")3.3. Use readr to read using the url of the data
df2_shortname <- read_csv("url_of_the_data")write_csv(df2_shortname, "data/df2_shortname.csv")df2_shortname <- read_csv("data/df2_shortname.csv")3.5. Use readxl to read Excel data. Add
library(readxl) in the setup and run.
df4 <- read_excel("data/file_name.xlsx", sheet = 1)References: Cheat Sheet - readr, readr, readxl
4.1. Look at the data: suppose df is the data frame
dt <- as_tibble(df)head(df), str(df),
summary(df), dt, glimpse(dt)4.2. Look at each variable
4.3. Variation of each data: suppose x1 is a column
name.
df %>% ggplot() + geom_histogram(aes(x1), bins = 30)
df %>% drop_na(x1): see the rows with a value in
x1. If the value is NA, the row is not shown.
df_wo_na <- df %>% drop_na(x1) if you want to use
only the rows without NA in x14.4. Use dpylr and tidyr to change column
names, tidy data, and/or summarize data
rename, select, filter,
arrange, mutate, pivot_longer(),
pivot_wider(), group_by and
summarizeReferences: Cheat Sheet - dplyr and tidyr,
dplyr, tidyr
5.1. In combination with Stap 4 - data transformation, try various data visualization.
5.2. Keep a record of what you can observe by the visualization
5.3. Edit the list of questions by adding or polishing
5.4. Select several informative chart and add options
5.5. Look at examples from the textbooks or teaching site to have better visualization
References: Cheat Sheet - ggplot2 ggplot2, ggplot2 book
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
Government expenditure on education, total (% of GDP)
ID: SE.XPD.TOTL.GD.ZS
tidyr and WIR2022
a3_123456.nb.html by
replacing 123456 with your ID)
a3_123456.Rmd,a3_123456.nb.html,a3_123456.nb.html to Moodle.Choose a data with at least two categorical variables and at least two numerical variables.
Explore the data using visualization using
ggplot2
Observations based on your data visualization, and difficulties and questions encountered if any.
Due: 2023-01-23 23:59:00. Submit your R Notebook file in Moodle (The Fourth Assignment). Due on Monday!